perm filename TALK[KI,ALS] blob sn#094478 filedate 1974-03-28 generic text, type T, neo UTF8
00100	This is in the nature of a progress report rather than any sort of
00200	finished paper describing a completed piece of work. We at Stanford
00300	have felt that some work was needed on what might be called the front
00400	end of a speech understanding system to bring into balance the over-all
00500	effort on speech understanding that is currently being sponsored by
00600	ARPA. 
00700	
00702	In the early 60's it was quite fashionable to investigate speech on a
00704	pitch synchronous basis, this in spite of the fact that the facilities
00706	for doing this were quite primative as compared with those that we have
00708	today. The work of Mathews, Miller and David can be cited as an example.
00710	The computational compexities of this approach using Fourier analysis
00712	led some to attempt to obtain similar results by direct analysis in the
00714	time domain. The work of Pinson can be refeerenced an and example. This
00716	lead to the development of a variety of so-called LPC methods which more
00718	recently have been shown to essentially equivalent. With the current
00720	availability of speciallized fast auxillary hardware, the problem of
00722	doing Fourier transforms no longer seems to pose the problem that it once
00724	did and it seems desirable to once again go bact to the general methods
00726	of Mathews, Miller ans David. Feeling that this is currently a neglected
00728	phase we have devoted considerable effort to it.
00800	A few remarks regarding the long term effort in speech recognition will,
00900	I think, restore a sense of perspective. When the current APRA projects
01000	were started there had been a long history of continued study of the
01100	mechanisms of speech production and recognition by such organizations
01200	as The Bell Telephone Laboratories, Haskins Laboratory to name but two
01300	of a long string of organizations which have done and continue to do good
01400	work. Some of this work is far from new. I well remember the state of the
01500	art in 1928 when I joined the Bell Laboratories and became acquainted 
01600	with Harvey Fletcher and his associates. In spite of this long continued
01700	effort at basic understanding, as of roughly three years ago  the
01800	practical results in terms of orperating speech recognition systems was
01900	essentially nil. Oh, it was true that there had been many demonstrations
02000	of word recognition, perhaps the most successful one being that by
02100	Reddy and Vicens  at Stanford, which incidentally was supported by ARPA.
02200	Never-the-less it had become apparent that a continuation of this same
02300	brute force attack on speech recognition without understanding was more
02400	or less a blind alley and that a rather drastic infussion of new ideas
02500	and incidentally of new money was needed if the machine recognition of
02600	continuous speech was ever to become a reality.
02700	
02800	As you all know, several major projects were initiated as of that time,
02900	and many of these have been or are to be reported at this conference.
03000	With this massive infusion of new talent and with the loss of our
03100	major workers at Stanford with the departure of Raj Reddy ans several
03200	of his students, we were left with no long term workers in this field
03300	but with the facilities to do speech work and with some students still
03400	working on their degrees. I became interested in the field and I was
03500	faced with th problem as to how we could continue to do useful work in
03600	the speech field with inadequate financial support and a very small group
03700	of people. In surveying the field it seemed to me that most of the
03800	workers had rather assumed that work on the front end had reached the
03900	state of deminishing return and that all of their effort should be
04000	directed to other aspects. With the danger of the pendulum swinging
04100	to far in this direction we decided to direct our efforts exclusively
04200	to the front end.
04300	
04400	Two of the more important contributions that we have made have been
04500	reported separately at this conference and will not be described in
04600	detail in this talk.
04700	
04800	Let me begin by outlineing some of the ways in which continued work on
04900	the acoustic end can contribute toward the realization of a more
05000	effective overall system.
05100	
05200	The first problem is that of isolating those portions of the incoming
05300	acoustic wave that warrent special attention. Speech is a highly redundant
05400	process. Some of this redundancy is introduced simply because of the
05500	limitations
05600	biological limitations of our vocal tract, some because of limitations
05700	of the ear as a transducer, but some of the redundancy performs a very
05800	useful function of compensating for these very limitations and in
05900	making it possible to transmit intelligence by speech in the presence
06000	of background noise and of distortions in transmission. Our problem is
06100	to separate these various aspects, to retain useful redundancies, and
06200	to reduce the amount of information that is left for processing at as
06300	early a stage as possible. As computer scientists, we have at our
06400	command certain analytical tools that do not have direct counterparts
06500	in the human speech recognition channel. We can therefore safely ignore
06600	some aspects of the incoming wave. At the same time, the very great 
06700	computational speed at our disposal makes it possible for us to retain
06800	a certain amount of redundancy to make the system robust. Mush of our
06900	work at Stanford has centered around this aspect of the problem.
07000	We have concentrated our efforts on the problem of extracting as much
07100	information from the wave while still in the time domain as possible,
07200	and at the other extreme we have explored mechanisms for using
07300	redundancies in the interest of robustness.
07400	
07500	
07600	The work in the time domain has been adequately covered in other papers.
07700	I will simply show two illustrations taken from these papers. The first
07800	illustration shows the present performance of an acoustic
07900	segmenter, designed to work in real time and to delineate those
08000	regions of the wave form that can be thought of as being esentially
08100	steady state and those displaying the maximum amount of transition. Both
08200	regions are thought to be of value in aiding recognition.
08300	
08400	The second illustration has to do with the location of pitch marks marking
08500	the zero crossings ehich preceed the maximum
08600	excursions of the wave form. We have found that Fourier Transforms or
08700	LPC Transforms which are based on s single period of the input
08800	wave are most revealing as to the configuration of the vocal tract at the
08900	time without the complications introduced by glottal interaction.
09000	Undoubtedly glottal interaction effects have a great deal to do with
09100	those speaker specific characteristics which allow us to identify
09200	the speaker quite irrespective of what he is saying but it is our
09300	belief that they have little or nothing to do with understanding.
09400